Optimize server-setup, by offloading large matrix multiplication to GPU by itzmeanjan · Pull Request #11 · itzmeanjan/ChalametPIR

itzmeanjan · 2025-04-05T03:22:21Z

Use vulkan compute shaders to offload large matrix multiplication and matrix transposition to GPU
(feature-gated by non-default gpu feature), for speeding up server-setup phase of ChalametPIR.

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Shader is taken from https://gist.github.com/itzmeanjan/84613bc7595372c5e6b6c22481d42f9a Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

… it finishes Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

…queue Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

… buffer creation Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

…tion Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

… function Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

…spond` Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

itzmeanjan · 2025-04-05T03:55:57Z

Without gpu feature, server-setup cost on Intel i7-1260P CPU

$ cargo bench --features mutate_internal_client_state --profile optimized --bench offline_phase -q server_setup
Timer precision: 10 ns
offline_phase                                                                        fastest       │ slowest       │ median        │ mean          │ samples │ iters
╰─ server_setup                                                                                    │               │               │               │         │
   ├─ 3                                                                                            │               │               │               │         │
   │  ╰─ DBConfig { db_entry_count: 65536, key_byte_len: 32, value_byte_len: 1024 }  2.522 m       │ 2.648 m       │ 2.585 m       │ 2.585 m       │ 2       │ 2
   ╰─ 4                                                                                            │               │               │               │         │
      ╰─ DBConfig { db_entry_count: 65536, key_byte_len: 32, value_byte_len: 1024 }  2.535 m       │ 2.552 m       │ 2.543 m       │ 2.543 m       │ 2       │ 2

When enabled the gpu feature, server-setup is ~12.45x faster 🚀

$ cargo bench --features mutate_internal_client_state,gpu --profile optimized --bench offline_phase -q server_setup
Timer precision: 10 ns
offline_phase                                                                        fastest       │ slowest       │ median        │ mean          │ samples │ iters
╰─ server_setup                                                                                    │               │               │               │         │
   ├─ 3                                                                                            │               │               │               │         │
   │  ╰─ DBConfig { db_entry_count: 65536, key_byte_len: 32, value_byte_len: 1024 }  12.18 s       │ 12.69 s       │ 12.45 s       │ 12.46 s       │ 25      │ 25
   ╰─ 4                                                                                            │               │               │               │         │
      ╰─ DBConfig { db_entry_count: 65536, key_byte_len: 32, value_byte_len: 1024 }  11.73 s       │ 12.24 s       │ 11.86 s       │ 11.87 s       │ 26      │ 26

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

itzmeanjan · 2025-04-06T06:34:59Z

I benchmarked server-setup on AWS EC2 instance g6e.8xlarge, featuring Nvidia L40S tensor core GPUs.

Server-setup on CPU

Server-setup, partially offloaded to GPU

Note

Server-setup can be offloaded to GPU, by enabling feature gpu. You need to install Vulkan drivers and library for this feature to work.

itzmeanjan added 27 commits March 19, 2025 16:02

Add dependencies for new feature gpu

28a85da

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Use u32 for matrix dimensions

11e5b77

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Add compute shader for matrix-matrix multiplication

cfb9124

Shader is taken from https://gist.github.com/itzmeanjan/84613bc7595372c5e6b6c22481d42f9a Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Setup a Vulkan device and queue so that commands can be submitted to it

4cb8d97

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Setup gpu returns a memory allocator and command buffer allocator too

4b9bac8

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Given a matrix, returns a buffer with transfer-src flag set

9c2ba00

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Add error enum for vulkan buffer creation failure

dce7da2

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Simplify return in matrix to transfer source buffer function

2b8d84c

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Add function recording Vulkan buffer to buffer data transfer command

96552dc

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Make error type more explicit

0e21934

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Add function to create empty Vulkan storage buffer

1b3c3bc

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Add function to submit transfer command buffer to queue and wait till…

e526074

… it finishes Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Rename error enum variant to be more generic

db5aca1

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Add function for computing number of bytes required to encode matrix

8ff3965

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Matrix-matrix multiplication command submission and execution on GPU …

9f4e0ea

…queue Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Reformat GLSL compute shader using clang-format

3d5757b

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Add matrix transpose compute shader

1cc4806

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Submit and wait for matrix transpose job to finish on GPU

98a0746

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Fix matrix transpose shader

679bc17

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Refactor function for transferring host matrix to device

3f33f81

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Maintain two different functions for host-accessible and device-local…

9b50f41

… buffer creation Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Implementation server-setup phase for gpu feature

450d7dc

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Add row-vector transposed matrix multiplication compute shader

3be9c22

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Implement server-respond function, using gpu feature

40ba459

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Change work-group size for vector-matrix multiplication shader invoca…

ec4a802

…tion Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Duplicate comment for gpu feature-gated version of server-respond…

1d2ed91

… function Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Avoid computing vector-matrix multiplication on GPU during `server-re…

fe5ce49

…spond` Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

itzmeanjan added 2 commits April 6, 2025 11:50

Update project documentation mentioning about the gpu feature gate

1e391ad

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

Prepare for release v0.5.0

42a6736

Signed-off-by: Anjan Roy <hello@itzmeanjan.in>

itzmeanjan merged commit 0646d4e into main Apr 6, 2025
5 checks passed

itzmeanjan deleted the integrate-mat-mul-on-gpu branch April 6, 2025 06:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize server-setup, by offloading large matrix multiplication to GPU#11

Optimize server-setup, by offloading large matrix multiplication to GPU#11
itzmeanjan merged 29 commits intomainfrom
integrate-mat-mul-on-gpu

itzmeanjan commented Apr 5, 2025

Uh oh!

itzmeanjan commented Apr 5, 2025

Uh oh!

itzmeanjan commented Apr 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

itzmeanjan commented Apr 5, 2025

Uh oh!

itzmeanjan commented Apr 5, 2025

Uh oh!

itzmeanjan commented Apr 6, 2025

Server-setup on CPU

Server-setup, partially offloaded to GPU

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant